code unit
UTF-8 Plumbing: Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8
Firestone, Preston, Ugare, Shubham, Singh, Gagandeep, Misailovic, Sasa
Subword tokenization segments input text according to a pre-defined vocabulary to feed it into a language model; the language model, in turn, generates a sequence made from this same vocabulary. The members of the vocabulary can be built of code points or bytes. Using code points means that all members of the vocabulary are valid UTF-8 characters. However, it also requires thousands of initial members to achieve acceptable coverage of inputs. Beginning with bytes, on the contrary, avoids out-of-vocabulary errors with only 256 initial members of the vocabulary, but the members of the vocabulary and sequences of them are not guaranteed to be valid UTF-8. Sequences that are not valid UTF-8 break code that assumes its input to be valid UTF-8. Applications of language models must account for the breakage thereby introduced. In this paper, we formalize tokenization using monoid theory and prove that tokenizers whose vocabularies contain tokens that are ill-formed UTF-8 can always produce sequences that are ill-formed UTF-8. We demonstrate formally that attempting to incrementally convert tokens back to a string and interpret the results as UTF-8 gives different results than converting the whole sequence of tokens at once. This formal result predicts real-world bugs: we evaluate mitigations for the problem identified and provide case studies of major foundation models, serving engines, and constrained generation systems.
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Asia > Middle East > Jordan (0.04)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- (14 more...)
Project-Level C-to-Rust Translation via Synergistic Integration of Knowledge Graphs and Large Language Models
Yuan, Zhiqiang, Mao, Wenjun, Chen, Zhuo, Shang, Xiyue, Wang, Chong, Lou, Yiling, Peng, Xin
Translating C code into safe Rust is an effective way to ensure its memory safety. Compared to rule-based translation which produces Rust code that remains largely unsafe, LLM-based methods can generate more idiomatic and safer Rust code because LLMs have been trained on vast amount of human-written idiomatic code. Although promising, existing LLM-based methods still struggle with project-level C-to-Rust translation. They typically partition a C project into smaller units (\eg{} functions) based on call graphs and translate them bottom-up to resolve program dependencies. However, this bottom-up, unit-by-unit paradigm often fails to translate pointers due to the lack of a global perspective on their usage. To address this problem, we propose a novel C-Rust Pointer Knowledge Graph (KG) that enriches a code-dependency graph with two types of pointer semantics: (i) pointer-usage information which record global behaviors such as points-to flows and map lower-level struct usage to higher-level units; and (ii) Rust-oriented annotations which encode ownership, mutability, nullability, and lifetime. Synthesizing the \kg{} with LLMs, we further propose \ourtool{}, which implements a project-level C-to-Rust translation technique. In \ourtool{}, the \kg{} provides LLMs with comprehensive pointer semantics from a global perspective, thus guiding LLMs towards generating safe and idiomatic Rust code from a given C project. Our experiments show that \ourtool{} reduces unsafe usages in translated Rust by 99.9\% compared to both rule-based translation and traditional LLM-based rewriting, while achieving an average 29.3\% higher functional correctness than those fuzzing-enhanced LLM methods.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Oceania > Australia > Victoria > Melbourne (0.04)
- Asia > Singapore (0.04)
- (9 more...)
Commenting Higher-level Code Unit: Full Code, Reduced Code, or Hierarchical Code Summarization
Sun, Weisong, Zhang, Yiran, Zhu, Jie, Wang, Zhihui, Fang, Chunrong, Zhang, Yonglong, Feng, Yebo, Huang, Jiangping, Wang, Xingya, Jin, Zhi, Liu, Yang
Commenting code is a crucial activity in software development, as it aids in facilitating future maintenance and updates. To enhance the efficiency of writing comments and reduce developers' workload, researchers has proposed various automated code summarization (ACS) techniques to automatically generate comments/summaries for given code units. However, these ACS techniques primarily focus on generating summaries for code units at the method level. There is a significant lack of research on summarizing higher-level code units, such as file-level and module-level code units, despite the fact that summaries of these higher-level code units are highly useful for quickly gaining a macro-level understanding of software components and architecture. To fill this gap, in this paper, we conduct a systematic study on how to use LLMs for commenting higher-level code units, including file level and module level. These higher-level units are significantly larger than method-level ones, which poses challenges in handling long code inputs within LLM constraints and maintaining efficiency. To address these issues, we explore various summarization strategies for ACS of higher-level code units, which can be divided into three types: full code summarization, reduced code summarization, and hierarchical code summarization. The experimental results suggest that for summarizing file-level code units, using the full code is the most effective approach, with reduced code serving as a cost-efficient alternative. However, for summarizing module-level code units, hierarchical code summarization becomes the most promising strategy. In addition, inspired by the research on method-level ACS, we also investigate using the LLM as an evaluator to evaluate the quality of summaries of higher-level code units. The experimental results demonstrate that the LLM's evaluation results strongly correlate with human evaluations.
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.34)
- Health & Medicine > Therapeutic Area > Immunology (0.34)
- Health & Medicine > Epidemiology (0.34)
Advancing Neuromorphic Computing: Mixed-Signal Design Techniques Leveraging Brain Code Units and Fundamental Code Units
Isik, Murat, Miziev, Sols, Pawlak, Wiktoria, Howard, Newton
This paper introduces a groundbreaking digital neuromorphic architecture that innovatively integrates Brain Code Unit (BCU) and Fundamental Code Unit (FCU) using mixedsignal design methodologies. Leveraging open-source datasets and the latest advances in materials science, our research focuses on enhancing the computational efficiency, accuracy, and adaptability of neuromorphic systems. The core of our approach lies in harmonizing the precision and scalability of digital systems with the robustness and energy efficiency of analog processing. Through experimentation, we demonstrate the effectiveness of our system across various metrics. The BCU achieved an accuracy of 88.0% and a power efficiency of 20.0 GOP/s/W, while the FCU recorded an accuracy of 86.5% and a power efficiency of 18.5 GOP/s/W. Our mixed-signal design approach significantly improved latency and throughput, achieving a latency as low as 0.75 ms and throughput up to 213 TOP/s. These results firmly establish the potential of our architecture in neuromorphic computing, providing a solid foundation for future developments in this domain. Our study underscores the feasibility of mixedsignal neuromorphic systems and their promise in advancing the field, particularly in applications requiring high efficiency and adaptability
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- North America > United States > Rhode Island > Providence County > Providence (0.04)
- North America > United States > California > Santa Clara County > Stanford (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Overview (0.93)
- Research Report (0.82)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Semiconductors & Electronics (0.68)
Discovering Program Topoi Through Clustering
Ieva, Carlo (Simula Research Laboratory) | Gotlieb, Arnaud (Simula Research Laboratory) | Kaci, Souhila (LIRMM, University of Montpellier) | Lazaar, Nadjib (LIRMM, University of Montpellier)
Understanding source code of large open-source software projects is very challenging when there is only little documentation. New developers face the task of classifying a huge number of files and functions without any help. This paper documents a novel approach to this problem, called FEAT, that automatically extracts topoi from source code by using hierarchical agglomerative clustering. Program topoi summarize the main capabilities of a software system by presenting to developers clustered lists of functions together with an index of their relevant words. The clustering method used in FEAT exploits a new hybrid distance which combines both textual and structural elements automatically extracted from source code and comments. The experimental evaluation of FEAT shows that this approach is suitable to understand open-source software projects of size approaching 2,000 functions and 150 files, which opens the door for its deployment in the open-source community.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Norway > Eastern Norway > Oslo (0.04)
- Europe > France > Occitanie > Hérault > Montpellier (0.04)
- Research Report (0.67)
- Workflow (0.46)
- Overview > Innovation (0.34)
- Information Technology > Software (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Developing Population Codes by Minimizing Description Length
Zemel, Richard S., Hinton, Geoffrey E.
The Minimum Description Length principle (MDL) can be used to train the hidden units of a neural network to extract a representation that is cheap to describe but nonetheless allows the input to be reconstructed accurately. We show how MDL can be used to develop highly redundant population codes. Each hidden unit has a location in a low-dimensional implicit space. If the hidden unit activities form a bump of a standard shape in this space, they can be cheaply encoded by the center ofthis bump. So the weights from the input units to the hidden units in an autoencoder are trained to make the activities form a standard bump.
- North America > Canada > Ontario > Toronto (0.15)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Asia > Singapore (0.04)
Developing Population Codes by Minimizing Description Length
Zemel, Richard S., Hinton, Geoffrey E.
The Minimum Description Length principle (MDL) can be used to train the hidden units of a neural network to extract a representation thatis cheap to describe but nonetheless allows the input to be reconstructed accurately. We show how MDL can be used to develop highly redundant population codes. Each hidden unit has a location in a low-dimensional implicit space. If the hidden unit activities form a bump of a standard shape in this space, they can be cheaply encoded by the center ofthis bump. So the weights from the input units to the hidden units in an autoencoder are trained to make the activities form a standard bump.
- North America > Canada > Ontario > Toronto (0.15)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Asia > Singapore (0.04)